HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
fit <- cfa(HS.model, data = HolzingerSwineford1939)
lavaanPlot(model = fit)An introduction to modern test theory
2023-12-06
We want to measure something that is not directly accessible by using one or more proxy indicators.
Based on observed indicators (response data collected from participants), the latent variable is assigned a value for each respondent/participant. The value of the latent variable is the measurement. The indicators themselves are not measurements, they are indicators of a latent variable.
You will create a brief questionnaire to measure a unidimensional latent variable of your choice.
We’ll primarily look at the first two types, as they are most commonly used in psychology and psychometrics.
Johansson, M., Preuter, M., Karlsson, S., Möllerberg, M.-L., Svensson, H., & Melin, J. (2023). Valid and Reliable? Basic and Expanded Recommendations for Psychometric Reporting and Quality Assessment. OSF Preprints. https://doi.org/10.31219/osf.io/3htzc
We need a reasonable psychometric analysis to justify the use of a sum score! And even then the “sum score” is a debatable metric in itself since it is ordinal data, but often gets treated like interval data in statistical models.
We will look at each of these in more detail during this lecture.
| Criterion | Description |
|---|---|
| Unidimensionality | Items represent one latent variable, without strongly correlated item residuals ('local independence'). Principal Component Analysis and Exploratory Factor Analysis of raw data are explorative methods. |
| Criterion | Description |
|---|---|
| Ordered response categories | A higher person location (sum score) on the latent variable should entail an increased probability of a higher response (category) for all items and vice versa. Sometimes referred to as 'monotonicity'. |
| Criterion | Description |
|---|---|
| Invariance | Item and measure properties are consistent between relevant demographic groups (gender, age, ethnicity, time, etc). Test-retest correlation is not an invariance test since it does not provide information about item properties. |
| Criterion | Description |
|---|---|
| Targeting | Item (threshold) locations compared to person locations should be well matched and not show ceiling or floor effects, or large gaps. |
| Criterion | Description |
|---|---|
| Reliability | Sufficient reliability for the expected properties of the target population and intended use of results. Reliability is contingent upon the other criteria being fulfilled and should not be reported for scales with inadequate properties. |
Let’s dive into Item Response Theory and Rasch Measurement Theory!
| q1 | q2 | q3 | q4 | q5 | q6 | q7 | q8 | q9 |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
The assumed basic structure of the data is that there is a systematic pattern across items and participants of an increased probability of correct responses as the latent ability increases. The Rasch model can be described as a probabilistic Guttman scale (Andrich 1985).
This figure shows items and persons sorted based on the number of correct responses (colored blue). You can see the gradual shift from lower left to upper right that shows the Guttman pattern.
This is a key figure in understanding IRT and the concept of item difficulty/location. The x axis is the latent ability (aka latent variable/dimension/continuum), and the y axis is the probability of a correct response.
ICC = item characteristic curve.
The point on the x-axis where the y-axis = 0.5 is the item difficulty (aka item location or item threshold). This is the threshold when the probability of a correct response is equal to the probability of an incorrect response. A correct response on this item indicates a higher ability than the item difficulty, and an incorrect response indicates a lower ability than the item difficulty.
ICC = item characteristic curve.
By tradition in Rasch/IRT terminology, the term “difficulty” is used to describe the item location on the latent dimension/variable/continuum and “ability” is used to describe the person location on the latent dimension/variable/continuum.
Ability and difficulty are intuitive when describing ability tests, but may be confusing when looking at other types of latent constructs, such as depression or well-being.
A more generic term is the “location” of items and persons on the latent variable. We will move towards using that more consistently in this lecture, but there will be some variation in terminology (sorry!).
This figure illustrates how the items are ordered (“item hierarchy”) according to the locations on the latent dimension/variable where they provide the most information - the item threshold - which is the location on the x axis where probability = 0.5 (on the y axis) for each dichotomous item.
IRT/Rasch uses the logit scale, which is an interval scale, for both items and persons. This means that the distance between two points on the scale is the same, regardless of where on the scale you are. Great for statistical analysis!
However, the values on the logit scale have no inherent meaning or external reference point. This is why we need to look at the item difficulty and person ability in relation to each other. Do not conflate the zero point on the logit scale with something like 0 on a Z-score scale.
You can also not interpret a person location as good or bad, or high or low, without looking at it in the context of more person locations. A person with a location of 0 is not necessarily “average”, “normal”, or “healthy” (or anything else). It is just a location on the latent variable.
This is another key figure. The points in the bottom part show the locations for each item.
The top histogram shows the distribution on the latent variable for the respondents (aka person locations/abilities).
The middle section aggregates the item thresholds to help visualize how the item locations correspond to the person locations.
This figure is sometimes called a “person-item map” or “Wright map” (Boone and Noltemeyer 2017, 8).
Since items and persons are on the same scale, we can infer a person’s item responses from their latent variable location/score. Let’s say we have a person with Location = 0 as an example.
Which items would this person be most likely to score correctly?
We have looked at the Rasch model for dichotomous data, also sometimes (not quite correctly) referred to as the IRT 1PL model. The key parameter is the item location/difficulty. 1PL stands for “one parameter logistic” model, and the parameter is the item location.
I generally use the term “location” for both items and persons as it is more generic. But for didactic purposes, when speaking of ability tests, it is probably easier to think about the specific term “difficulty” for items and how it related to the persons latent “ability”.
In IRT terminology, “person location” is frequently referred to as “theta”, often using this symbol: \(\theta\)
Let’s return to the items you created before.
The 2PL and 3PL models are also commonly used. The 2PL model adds a second parameter, item discrimination, which describes how well the item separates between persons with high and low ability.
The 3PL model adds a third parameter, which makes the figure look like this.
Can you guess what the third parameter is?
Now let’s move on to questionnaire type of data with ordered response categories.
We’ll focus on the Rasch model, in part since it is complicated enough for this short lecture. But also since it is the only model that allows the ordinal sum score to be used as a sufficient statistic for the latent variable.
This is the same type of figure as before, and it now shows probabilities for all response categories for one item. How do you interpret it?
A polytomous item has multiple thresholds, located where the probability curves for adjacent response categories intersect. The intersections are shown with dashed vertical lines.
Recall \(\theta\) (person location)?
This is an item from the Perceived Stress Scale (PSS).
The diamond symbol represents the estimated theta value (person location on the latent variable) for a person who responded “often” to this question (and this question only).
Is this person’s theta a high or low value? How does it relate to the overall sample?
q8n: “found that you could not cope with all the things that you had to do?” - “Sometimes”
q2n: “felt that you were unable to control important things in your life?” - “Seldom”
See how that error bar gets smaller and smaller as we add more items? That’s because we’re getting more and more information about the respondents theta with each item.
Here are all 7 items from the PSS negative items (Rozental, Forsström, and Johansson 2023). Discuss 2 & 2 how you interpret this figure.
Why is this figure of interest?
Test information - reflects item properties, not sample/person properties.
We have been using the Partial Credit Model with Conditional Maximum Likelihood estimation, using the eRm package for R.
When analyzing polytomous data with Rasch/IRT models, the lowest response category is always set to 0, as it reflects on the number of thresholds “passed” by the respondent.
Think of this related to the dichotomous mode, where 0 and 1 are the only scores available and the sum score is just a count of the number of items with correct responses. In the polytomous case, the sum score is a count of the number of thresholds passed per item, summed together.
We could put any set of items into a Rasch model and just look at ICC curves and targeting and estimate thetas. But that is not much better than “sum score and alpha”.
So let’s put the Rasch model to use in context of the five psychometric criteria mentioned in the beginning.
While multidimensional constructs are possible, it is outside the scope of this lecture. Most often, even unidimensional measures are not well constructed. The most common problem (in my experience) with dimensionality is residual correlations. We’ll look at four ways to assess unidimensionality in a Rasch model:
We’ll use the same published dataset and paper analyzing the PSS-14 scale as before (Rozental, Forsström, and Johansson 2023) as an example for dimensionality analysis.
We use multiple tests since there is no single test to establish unidimensionality. This is also true for CTT analysis.
Does everyone know what residuals are?
“Outfit” refers to item fit when person location is far from the item location, while “infit” refers to when person and item locations are close together. MSQ should be close to 1, with lower and upper cutoffs often set to 0.7 and 1.3, while ZSTD should be around 0, with cutoffs set to +/- 1.96. Infit is usually more important. Low fit indicates a better than expected fit to the Rasch model but can inflate reliability without adding much information. High fit often reflects multidimensionality.
| OutfitMSQ | InfitMSQ | OutfitZSTD | InfitZSTD | |
|---|---|---|---|---|
| q1n | 0.873 | 0.871 | -1.409 | -1.663 |
| q2n | 0.946 | 0.937 | -0.537 | -0.851 |
| q3n | 0.878 | 0.881 | -1.69 | -1.788 |
| q4p | 0.889 | 0.887 | -1.24 | -1.435 |
| q5p | 0.912 | 0.914 | -1.171 | -1.144 |
| q6p | 0.947 | 0.938 | -0.97 | -0.846 |
| q7p | 0.981 | 0.982 | -0.293 | -0.22 |
| q8n | 0.936 | 0.94 | -0.948 | -0.732 |
| q9p | 1.036 | 1.034 | 0.468 | 0.782 |
| q10p | 1.033 | 1.025 | 0.396 | 0.389 |
| q11n | 0.927 | 0.922 | -0.75 | -1.099 |
| q12n | 0.826 | 0.844 | -1.911 | -1.709 |
| q13p | 1.027 | 1.027 | 0.306 | 0.258 |
| q14n | 1.027 | 1.018 | 0.603 | 0.249 |
The highest eigenvalue should be below 2.0
| Eigenvalue |
|---|
| 6.38 |
| 1.43 |
| 0.84 |
| 0.80 |
| 0.69 |
We clearly have two clusters of items. Can you spot the pattern?
It is the relative correlation between items that is important, not the absolute value. The residual correlations should not be above 0.2 above the mean correlation of all item pairs (Christensen, Makransky, and Horton 2017).
| q1n | q2n | q3n | q4p | q5p | q6p | q7p | q8n | q9p | q10p | q11n | q12n | q13p | q14n | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| q1n | ||||||||||||||
| q2n | 0.53 | |||||||||||||
| q3n | 0.4 | 0.53 | ||||||||||||
| q4p | -0.33 | -0.43 | -0.35 | |||||||||||
| q5p | -0.37 | -0.52 | -0.44 | 0.58 | ||||||||||
| q6p | -0.33 | -0.54 | -0.49 | 0.5 | 0.64 | |||||||||
| q7p | -0.4 | -0.52 | -0.44 | 0.39 | 0.46 | 0.51 | ||||||||
| q8n | 0.21 | 0.42 | 0.42 | -0.28 | -0.38 | -0.39 | -0.39 | |||||||
| q9p | -0.47 | -0.46 | -0.4 | 0.42 | 0.42 | 0.38 | 0.39 | -0.32 | ||||||
| q10p | -0.38 | -0.58 | -0.52 | 0.39 | 0.48 | 0.5 | 0.53 | -0.49 | 0.42 | |||||
| q11n | 0.42 | 0.35 | 0.3 | -0.29 | -0.35 | -0.29 | -0.34 | 0.14 | -0.47 | -0.27 | ||||
| q12n | 0.14 | 0.27 | 0.36 | -0.12 | -0.17 | -0.22 | -0.17 | 0.35 | -0.19 | -0.28 | 0.06 | |||
| q13p | -0.2 | -0.45 | -0.36 | 0.27 | 0.36 | 0.35 | 0.31 | -0.46 | 0.24 | 0.49 | -0.17 | -0.3 | ||
| q14n | 0.31 | 0.54 | 0.51 | -0.43 | -0.52 | -0.57 | -0.5 | 0.47 | -0.37 | -0.6 | 0.25 | 0.25 | -0.5 | |
| Note: | ||||||||||||||
| Relative cut-off value (highlighted in red) is 0.172, which is 0.2 above the average correlation. |
This better illustrates the issue of items being too similar, often referred to as ‘local dependence’.
| itemnr | item |
|---|---|
| flourish1 | I lead a purposeful and meaningful life. |
| flourish2 | My social relationships are supportive and rewarding. |
| flourish3 | I am engaged and interested in my daily activities. |
| flourish4 | I actively contribute to the happiness and well-being of others. |
| flourish5 | I am competent and capable in the activities that are important to me. |
| flourish6 | I am a good person and live a good life. |
| flourish7 | I am optimistic about my future. |
| flourish8 | People respect me. |
| flourish1 | flourish2 | flourish3 | flourish4 | flourish5 | flourish6 | flourish7 | flourish8 | |
|---|---|---|---|---|---|---|---|---|
| flourish1 | ||||||||
| flourish2 | -0.18 | |||||||
| flourish3 | 0.15 | -0.15 | ||||||
| flourish4 | -0.04 | 0.07 | -0.19 | |||||
| flourish5 | -0.09 | -0.32 | 0.06 | -0.21 | ||||
| flourish6 | -0.24 | -0.18 | -0.26 | -0.2 | -0.14 | |||
| flourish7 | -0.22 | -0.2 | -0.15 | -0.24 | -0.13 | 0.05 | ||
| flourish8 | -0.11 | 0.03 | -0.25 | 0.11 | -0.28 | 0.18 | -0.11 | |
| Note: | ||||||||
| Relative cut-off value (highlighted in red) is 0.084, which is 0.2 above the average correlation. |
A higher person location on the latent variable should entail an increased probability of a higher response (category) for all items and vice versa. This is sometimes referred to as ‘monotonicity’.
We can check this by looking at the item characteristic curves (ICC). So far, we have only seen ICCs with ordered response categories. We will look at an example with disordered response categories.
This is by far the most common type of data in psychology and social sciences. But it is extremely common to pretend that ordinal data is interval data, and use it as such in everything from simple calculations such as mean/SD to more complex statistical models. This is a problem (Liddell and Kruschke 2018) and there are methods for analyzing ordinal data properly (i.e Bürkner and Vuorre 2019).
Adding more response categories does not make them anything else than ordinal, neither does removing the labels. Visual Analogue Scales have no strong case for being interval level either. All three methods usually add to problems with disordered response categories and invariance issues.
We’ll use an open dataset (Didino et al. 2019) for the Flourishing Scale (FS) (Diener et al. 2010). It has 8 items:
| itemnr | item |
|---|---|
| flourish1 | I lead a purposeful and meaningful life. |
| flourish2 | My social relationships are supportive and rewarding. |
| flourish3 | I am engaged and interested in my daily activities. |
| flourish4 | I actively contribute to the happiness and well-being of others. |
| flourish5 | I am competent and capable in the activities that are important to me. |
| flourish6 | I am a good person and live a good life. |
| flourish7 | I am optimistic about my future. |
| flourish8 | People respect me. |
All items share the same set of 7 response categories:
| Response | Ordinal |
|---|---|
| Strongly disagree | 0 |
| Disagree | 1 |
| Slightly disagree | 2 |
| Mixed or neither agree nor disagree | 3 |
| Slightly agree | 4 |
| Agree | 5 |
| Strongly agree | 6 |
What we look for in this figure is response categories that at no point on the x-axis has the highest probability. Visually, this means that the problematic probability curve does not pass “above” other categories. Which ones can you identify in this example?
There are several ways to address disordered response categories in the analysis phase. The most common is to merge adjacent categories and rerun the ICC analysis to check results. However, disordered thresholds can (at least partially) be a sign of other problems, such as misfitting items, multidimensionality, or local dependence.
Long term, it is important to investigate the cause and revise the questionnaire. Usually, the cause of disordered categories is related to having too many response categories, having bad or no labels on categories (only endpoints labeled is bad practice). Also, item formulation needs to work well with the response categories.
| Item | 2 | 3 | Mean location | StDev | MaxDiff |
|---|---|---|---|---|---|
| q4p | 0.380 | 0.377 | 0.378 | 0.002 | 0.003 |
| q5p | -0.216 | -0.621 | -0.418 | 0.287 | 0.406 |
| q6p | -0.265 | -0.691 | -0.478 | 0.301 | 0.426 |
| q7p | 0.275 | 0.333 | 0.304 | 0.041 | 0.058 |
| q9p | -0.561 | -0.021 | -0.291 | 0.382 | 0.540 |
| q10p | 0.042 | 0.127 | 0.085 | 0.060 | 0.085 |
| q13p | 0.346 | 0.498 | 0.422 | 0.108 | 0.152 |
Item q9p from PSS-14: “been able to control irritations in your life?”
Item threshold locations
Sorry for the Swedish example, this will be remade in English. It illustrates how the item thresholds can differ between two groups, most notably for the second threshold.
Affects statistical power to detect changes/differences in the latent trait. Reliability is a function of the number of items/item thresholds and the item locations. Having more items with more thresholds (ordered response categories), and the more dispersed the item locations are (instead of overlapping), the higher the reliability.
It is important to understand the reliability of the measure itself, and then how it applies to the population of interest. Since reliability is not constant across the latent variable continuum, targeting becomes a factor in determining reliability in a practical use case.
Remember: reliability is a property of the measurement instrument, not the sample.
Recall this figure? The horizontal lines around the diamond shapes get smaller and smaller for each item we add.
When using the Rasch model, we can simply look up the ordinal sum score in a table to get the interval score and its SEM value.
| Ordinal sum score | Logit score | Logit std.error |
|---|---|---|
| 0 | -4.000 | 1.295 |
| 1 | -3.167 | 0.935 |
| 2 | -2.514 | 0.755 |
| 3 | -2.038 | 0.662 |
| 4 | -1.651 | 0.604 |
| 5 | -1.319 | 0.563 |
| 6 | -1.026 | 0.532 |
| 7 | -0.760 | 0.508 |
| 8 | -0.515 | 0.489 |
| 9 | -0.287 | 0.473 |
| 10 | -0.073 | 0.460 |
| 11 | 0.131 | 0.450 |
| 12 | 0.326 | 0.441 |
| 13 | 0.513 | 0.434 |
| 14 | 0.696 | 0.429 |
| 15 | 0.874 | 0.426 |
| 16 | 1.051 | 0.425 |
| 17 | 1.226 | 0.426 |
| 18 | 1.404 | 0.430 |
| 19 | 1.584 | 0.436 |
| 20 | 1.772 | 0.446 |
| 21 | 1.968 | 0.460 |
| 22 | 2.180 | 0.479 |
| 23 | 2.412 | 0.506 |
| 24 | 2.676 | 0.544 |
| 25 | 2.988 | 0.601 |
| 26 | 3.384 | 0.692 |
| 27 | 3.951 | 0.871 |
| 28 | 5.000 | 1.400 |
The location of the respondent in the previous slide is shown using a dashed vertical line. Note the lack of item thresholds in that area.
CTT = factor analysis, principal component analysis, etc; Modern test theory = IRT, Rasch, Mokken, etc
No psychometric/statistical method is “safe” from misuse. Some common mistakes to look for in papers:
Rozental, A., Forsström, D., & Johansson, M. (2023). A psychometric evaluation of the Swedish translation of the Perceived Stress Scale: A Rasch analysis. BMC Psychiatry, 23(1), 690. https://doi.org/10.1186/s12888-023-05162-4
There are guides and tools available for doing Bayesian IRT in R as well. The brms package is highly recommended (Bürkner 2017, 2021) and this excellent blog post: https://solomonkurz.netlify.app/blog/2021-12-29-notes-on-the-bayesian-cumulative-probit/
| Package | Version | Citation |
|---|---|---|
| base | 4.2.3 | R Core Team (2023) |
| catR | 3.17 | Magis and Raîche (2012); Magis and Barrada (2017) |
| doParallel | 1.0.17 | Corporation and Weston (2022) |
| eRm | 1.0.2 | Mair and Hatzinger (2007b); Mair and Hatzinger (2007a); Hatzinger and Rusch (2009); Rusch, Maier, and Hatzinger (2013); Koller, Maier, and Hatzinger (2015); Debelak and Koller (2019); Mair, Hatzinger, and Maier (2021) |
| janitor | 2.2.0 | Firke (2023) |
| lavaan | 0.6.15 | Rosseel (2012) |
| lavaanPlot | 0.6.2 | Lishinski (2021) |
| mirt | 1.40 | Chalmers (2012) |
| patchwork | 1.1.2 | Pedersen (2022) |
| qrcode | 0.2.2 | Onkelinx and Teh (2023) |
| RISEkbmRasch | 0.1.20.1 | Johansson (2023) |
| rmarkdown | 2.22 | Xie, Allaire, and Grolemund (2018); Xie, Dervieux, and Riederer (2020); Allaire et al. (2023) |
| scales | 1.2.1 | Wickham and Seidel (2022) |
| tidyverse | 2.0.0 | Wickham et al. (2019) |